This tutorial was created for a 3-day free R introduction workshop organized by the Laforest-Lapointe Lab at Université de Sherbrooke on May 12th-14th.

1 Session 1 - Monday AM

R is an object-oriented programming language. Essentially, this means that data, values, inputs and outputs are saved as “objects” whose names are defined by the user. Objects can then be used to perform calculations, data manipulation, or visualisations.

Objectives:

  1. Familiarise ourseves with RStudio
  2. Learn the syntax of the R language
  3. Learn to work with objects

1.1 RStudio Environment

RStudio is a program that allows you to write a series of command as a script, have R execute these commands, visualise the output and see the objects that you create.

When you open RStudio for the first time, you will have three panels.

Top left: Script

Document where you will write the command lines. To create a script: File > New File > R Script. Save regularly!

Execute commands, place your cursor in the command’s line and press: Command + Enter for Mac, Control + Enter for Windows.

Bottom left: Console

Where the output of the command line will appear. When you execute commands from the script, the command and its output (or error messages!) will appear there.

Use it to execute commands that you do not want to save in the script or to test your commands before adding them to the script. To execute a command from the console, simply press enter after writing it.

Top right: Environment

Place where all the objects you create (dataframes, values, vectors) are shown. Click on any object to visualise it.

Bottom right: Files, Plots, Help

This is a files and plots viewer as well as the help center.

1.2 Syntax

1.2.1 Basic R usage

The very simplest thing you can do is use R as a calculator and get the output from the console:

# Everything written after a hashtag is a comment (cannot be executed)
2+4
## [1] 6

But we like to do more exciting things. Generally, we save values in objects and then execute operations on those objects.

Most R operations are written like this:

an_object <- a_value

This stores a value in an object named an_object. The value can be, for example, a single number, a character string, or a complex object like a matrix or a list. It can be user-written (“hardcoded”) like above, or be the output of a function, such as:

an_object <- a_function(values, some_parameters)

In this case, the function takes values as an input, applies operations to it, and spits out new values. Parameters are generally optional ways to control the behaviour of the function. Concrete examples will be presented in the next few sections.

# Store the output in an object
res <- 2+4
res
## [1] 6
# Multiply the result 
2*res
## [1] 12

1.2.2 Useful symbols

R uses many characters that have a specific role. Some are essential to know, which we present in this section.

  • $ access an element of an object by its name (e.g. the column of a data frame)
  • [] access an element of an object by its index (e.g. vector[3] accesses the 3rd element in than vector)
  • <- assign an object (shortcut: alt + - on Windows, option+ - on mac)
  • = assign an object; specify function parameters
  • "" and '' are used to express strings (character values)
  • ? used to learn how a function work. ?sum will open a help file where you can learn about how to use that function.

1.2.3 Comparison operators

To compare values to one another:

  • == equal
  • != is not equal

The following can only be used for numeric values:

  • < smaller than
  • > greater than
  • <= smaller or equal to
  • >= greater or equal to
5 > 6
## [1] FALSE
5 <= 6
## [1] TRUE

TRUE and FALSE are boolean values with a specific meaning. They represent a type of information that can only be true or false.

5 != 4+1
## [1] FALSE

1.2.4 Logical operators

  • & and
  • | or

They are used to test multiple conditions at once

(5 == 5) & (6 == 6)
## [1] TRUE
(5 == 4) & (6 == 6)
## [1] FALSE
(5 == 4) | (6 == 6)
## [1] TRUE

1.3 Object classes

Objects and object components (e.g., data frame variables) always have a class. This class gives you a hint on how it will behave.

There exist many object classes, but a few are much more common than others. In the above example with str(), we can see two other classes, namely num for numeric vectors and factor, which is a special (composite) class for categorical data. Other common classes are character (chr), integer (int), character (chr) and logical (logi).

We can also see a composite class, data.frame. These are used for more complex objects, usually with two dimensions. Other composite classes include matrix, which are similar to data frames (the difference between these two is that data frames can contain elements of various classes). There are also list and array, which won’t be covered in this course.

Character
Numeric
Integer
Factor
Logical
Date
Vector
Data frame
Matrix

Let’s create a few objects and look at the class R will assign them by default. There are a few basic functions worth learning about at this point. Functions are little programs that do specific things and you’ll be using them all the time. The most basic function is c() which is used to combine its arguments into a vector. If all arguments are numeric, the vector will be numeric. If they contain at least one string element (wrapped in "quotes"), it will be a character vector.

# Create a vector of numeric values
vec1 <- c(1,3,5)

# Create a vector of characters values
vec2 <- c("one","three","five")

# Create a vector of logical values
vec3 <- c(TRUE, TRUE, FALSE)

# Bind these vectors into a dataframe
df1 <- data.frame(vec1, vec2, vec3)
class(vec1) # Use function class( ) to get the class of an object
## [1] "numeric"
class(vec2)
## [1] "character"
# Change the class of vec2 to factor
vec2 <- as.factor(vec2) 
vec2
## [1] one   three five 
## Levels: five one three
class(vec2)
## [1] "factor"
class(df1)
## [1] "data.frame"
# Look at the structure of the data frame we made:
str(df1)
## 'data.frame':    3 obs. of  3 variables:
##  $ vec1: num  1 3 5
##  $ vec2: chr  "one" "three" "five"
##  $ vec3: logi  TRUE TRUE FALSE

1.4 Save an object

# Save object
saveRDS(df1, "df1.RDS")
# Read/open object
df1_new <- readRDS("df1.RDS")

This is very useful when you need to share R objects as is with collaborators. It’s also useful when you work with large datasets and want to save time. For example, you could have a first script dedicated to producing a certain output, which you save as an RDS object. Then, in a second script, this data can be taken as input and be worked on. This way, when working on your second script, you don’t need to regenerate this large input dataset every time you test the code on that second script.

1.5 Challenge N˚1

Create the following objects with 4 made-up values each:

  • sample_id: a character vector
  • CFU: a numeric vector
  • Control: a boolean vector telling us which samples are controls
  • Date : a vector of dates with format “YYYY-MM-DD”

Then,

  1. combine these into a data frame
  2. using str(), check the class of your data frame columns
  3. Save your dataframe as an RDS file
  4. Execute it from top to bottom and save it

1.6 Getting started with data

Now let’s import a small dataset and familiarise ourselves with a few functions that will help us look at our data in more detail.

1.6.1 Working directory

It’s good to have a working directory specifically dedicated to your project, where you store your data, scripts, output, etc. This way you can organise what goes in and comes out of R and keep track of your various scripts.

R can look anywhere in our computer for files. However, to load a file, you need to know its full path, e.g. '/Users/jorondo/Repos/RworkshopUdS_2025/data.txt' which can be annoying to write out. Instead, it’s useful to set your working directory once at the beginning of your script, and then use relative paths to find your file. To do this, start by creating a folder in your computer, from where to work:

setwd('~/RWorkshop2025')
dir.create('output')
dir.create('data')

The ~ is a shortcut to your Home folder. You can always create your folders by hand, but it helps to have the path in your script. Now, if we save an object we previously created,

saveRDS(my_first_data_frame, "output/my_first_data_frame.RDS")

It will be saved in that directory without needing to specify the absolute path.

1.6.2 Data inspection

In any project, a first step should always be to look at your data as it appears in R. Many functions exist to help you do that. summary() will summarise variables in a class-dependent manner: numeric variables get a numeric summary (median, range, etc.), factors and logical get a count per category, and characters get a overview of the first values.

assay <- read.table("data/assay.txt", header = TRUE)
summary(assay)
##       CFU         Antibiotic              OD       
##  Min.   : 4.20   Length:60          Min.   :0.500  
##  1st Qu.:13.07   Class :character   1st Qu.:0.500  
##  Median :19.25   Mode  :character   Median :1.000  
##  Mean   :18.81                      Mean   :1.167  
##  3rd Qu.:25.27                      3rd Qu.:2.000  
##  Max.   :33.90                      Max.   :2.000

Other useful functions are head(), which shows the first rows of an object, and View(), which opens shows the object in a formatted manner, a bit like an excel sheet.

head(assay)
##    CFU   Antibiotic  OD
## 1  4.2 streptomycin 0.5
## 2 11.5 streptomycin 0.5
## 3  7.3 streptomycin 0.5
## 4  5.8 streptomycin 0.5
## 5  6.4 streptomycin 0.5
## 6 10.0 streptomycin 0.5

Other useful checks are dim() or length(), especially when playing with functions that modify our objects:

dim(assay) # shows the dimensions of matrices and data frames
## [1] 60  3
ncol(assay) # number of columns
## [1] 3
colnames(assay) # column names (our variable names!)
## [1] "CFU"        "Antibiotic" "OD"
length(assay$OD) # check the length of a vector (a column fetched using $ is a vector)
## [1] 60

1.6.3 Basic functions

R can be used to operate on values in pretty much any way imaginable, from very simple things like calculating the mean to more complex things like removing certain characters from every values in a data frame variable.

Most mathematical operations are available as functions :

some_numbers <- head(assay$CFU, n = 10) # n is a parameter of the head() function
some_numbers
##  [1]  4.2 11.5  7.3  5.8  6.4 10.0 11.2 11.2  5.2  7.0
sqrt(some_numbers) # square root of each value
##  [1] 2.049390 3.391165 2.701851 2.408319 2.529822 3.162278 3.346640 3.346640
##  [9] 2.280351 2.645751
a_sum_value <- sum(some_numbers) # Sum the values of a numeric vector
a_length_value <- length(some_numbers) # Count the number of elements in that vector
a_sum_value; a_length_value  # print them to the console
## [1] 79.8
## [1] 10
values_mean <- a_sum_value/a_length_value # compute the mean "manually"
values_mean
## [1] 7.98
mean(some_numbers) # compute the mean using a specific function
## [1] 7.98
median(some_numbers)
## [1] 7.15
sd(some_numbers) # compute the standard deviation
## [1] 2.746634
min(some_numbers) # find the minimum
## [1] 4.2

Operations can also be done between columns of a data frame. In the following, we multiply two columns together and add the resulting vector as a new column on the same dataset using the <- (assignment) operator.

assay$CFU_times_OD <- assay$CFU * assay$OD # with $ sign
assay['CFU_times_OD'] <- assay$CFU * assay$OD # equivalent!
assay['log_CFU'] <- log10(assay$CFU)
summary(assay) 
##       CFU         Antibiotic              OD         CFU_times_OD   
##  Min.   : 4.20   Length:60          Min.   :0.500   Min.   : 2.100  
##  1st Qu.:13.07   Class :character   1st Qu.:0.500   1st Qu.: 6.875  
##  Median :19.25   Mode  :character   Median :1.000   Median :19.250  
##  Mean   :18.81                      Mean   :1.167   Mean   :25.746  
##  3rd Qu.:25.27                      3rd Qu.:2.000   3rd Qu.:46.750  
##  Max.   :33.90                      Max.   :2.000   Max.   :67.800  
##     log_CFU      
##  Min.   :0.6232  
##  1st Qu.:1.1153  
##  Median :1.2843  
##  Mean   :1.2287  
##  3rd Qu.:1.4027  
##  Max.   :1.5302

1.7 Loading packages

As you can see, Base R has hundreds of functions available. But sometimes, we need more. That’s when packages come in handy. Packages are collections of functions developed by people around the world for various purposes. These are made available freely to the public, and R has a function to install these packages. For example, you could run install.packages('tidyverse'), which we’ll use tomorrow. Now, whenever you use R, you can gain access to hundreds of new functions by adding library(tidyverse) at the beginning of your script.

Often, you’ll need multiple packages for your project. Instead of writing a series of library() commands at the top of your script, you can install the pacman package and use its very useful function, p_load(). This command installs AND loads any package, and you can load as many packages as you want with a single command.

Remember: every time you open RStudio, you’ll need to load the packages you need. A good practice is to put the package loading commands at the top of all your scripts.

1.8 Challenge N˚2

  1. Install the pacman package
  2. Download the dataset Rocio_tomato_modified.xlsx available on the Workshop repository or on the Teams documents
  3. Put this file in your working directory
  4. Create a new script and save it as Challenge2.R
  5. In this script :
  6. load the pacman package and use p_load() to install and load the package readxl, which we’ll need for the next tutorial
  7. Use read_xlsx() to read the file into an object
  8. Print a summary of that object
  9. Compute the mean of the chlorotic.area column and print it
  10. Run your script top-to-bottom to make sure it works entirely
  11. Add comments your code, briefly explaining what it does

1.9 Other useful functions

1.9.1 Use R as a calculator

As briefly shown at the beginning, you can also straight up use R as you would a calculator. It’s useful for complex arithmetic because you can write your whole equation and then execute it. For example, let’s find the area of a single slice of a 14” pizza cut in 8, then format the result to be more “readable” :

diameter <- 14
slices <- 8
slice_area <- (pi*(diameter/2)^2)/slices
slice_area_formatted <- round(slice_area, 2)

message('The slice of a 14 inches pizza cut in 8 will have a surface of ', slice_area_formatted, " square inches.")
## The slice of a 14 inches pizza cut in 8 will have a surface of 19.24 square inches.

Notice that we’ve created many objects to do this. We could just as well have done a single-liner that only prints the result to our console:

paste(
  'The slice of a 14 inches pizza cut in 8 will have a surface of',
  round((pi*(14/2)^2)/8, 2), 
  "square inches."
) # notice that I can split my command on multiple lines without affecting it!
## [1] "The slice of a 14 inches pizza cut in 8 will have a surface of 19.24 square inches."

but it’s up to you to decide how explicit you want your code to be. More explicit code is more readable, but requires more steps to be written out. It’s all about balance.

1.9.2 Random number generation

You can also use R as a random number generator. The simplest way is using the uniform distribution, runif(), which draws n observations between the specified parameters min and max.

a_vector_of_values <- runif(n = 10, min = 0, max = 100)
a_vector_of_values
##  [1] 97.24192 31.69451 26.69018 34.24109 51.03781 47.47513 17.20659 68.85718
##  [9] 85.22271 28.45370
sort(a_vector_of_values)
##  [1] 17.20659 26.69018 28.45370 31.69451 34.24109 47.47513 51.03781 68.85718
##  [9] 85.22271 97.24192

2 Session 2 - Monday PM

Objectives:

  1. Find errors in data and correct them
  2. Learn quick data visualisation techniques
  3. Learn about loops

Let’s look at a new, imperfect dataset, which is most likely what you will be dealing with. We thank workshop participant Maria Rocio Gonzalez-Lamothe for sharing it with us. This data comes from an experiment with tomato plants, which have been infected (or not) with a pathogen and have either received (or not) a treatement. The data collected the total area of the plant leaf, and the total area affected by chlorosis.

2.1 Reading from an Excel file

Earlier we used a dataset that was saved in a text file with columns delimited by tabulations (using the read.table() function). However, many of you save your data in excel files.

It is important to create objects with explicit and short names, that are not already built-in functions in R. Ex: do not create a data named data or summary or plot.

Let’s load a new dataset to work with !

library(pacman)
p_load(readxl)
#tomato_data_2025_ILL_test_repeat<-read.csv("Rocio_tomato_modified.csv")
tomato <- read_xlsx('data/Rocio_tomato_modified.xlsx', sheet = 2)
#tomato<-read.csv("data/Rocio_tomato_modified.csv")

Let’s look at this dataset.

head(tomato)
## # A tibble: 6 × 5
##     rep Treatement Infection total.area chlorotic.area
##   <dbl> <chr>      <chr>          <dbl>          <dbl>
## 1     1 mock       water        2663192          83409
## 2     1 mock       water        3466704         111165
## 3     1 mock       water        3604993          52202
## 4     1 treated    water        3605450          86927
## 5     1 treated    water        2323251          46141
## 6     1 treated    water        2938431          40150
summary(tomato)
##       rep        Treatement         Infection           total.area     
##  Min.   :1.00   Length:48          Length:48          Min.   :1034441  
##  1st Qu.:1.75   Class :character   Class :character   1st Qu.:2194426  
##  Median :2.50   Mode  :character   Mode  :character   Median :2856845  
##  Mean   :2.50                                         Mean   :2646678  
##  3rd Qu.:3.25                                         3rd Qu.:3291366  
##  Max.   :4.00                                         Max.   :4389281  
##  chlorotic.area  
##  Min.   : 14666  
##  1st Qu.: 31068  
##  Median : 54448  
##  Mean   : 81711  
##  3rd Qu.: 87013  
##  Max.   :365390

2.2 Formatting

Next, we want to make sure that variables are formatted adequately. Here, we have variables that are numeric (allowing for mathematical operations) and some that are factors (with levels). We will transform some in factors and take a look at the summary() to confirm the successful transformation.

tomato$rep<-as.factor(tomato$rep)
tomato$Treatement<-as.factor(tomato$Treatement)
tomato$Infection<-as.factor(tomato$Infection)
summary(tomato)
##  rep      Treatement    Infection    total.area      chlorotic.area  
##  1:12   mock   :24   pathogen:23   Min.   :1034441   Min.   : 14666  
##  2:12   treated:24   pathogn : 1   1st Qu.:2194426   1st Qu.: 31068  
##  3:12                water   :24   Median :2856845   Median : 54448  
##  4:12                              Mean   :2646678   Mean   : 81711  
##                                    3rd Qu.:3291366   3rd Qu.: 87013  
##                                    Max.   :4389281   Max.   :365390

We can observe quickly that a typo is now present in our data. We could either correct it in the raw data file and reimport a new object, or, to increase reproducibility, correct it in R using the which() function and dimensionality operators.

which(tomato$Infection=="pathogn") #find the position of the error in the variable
## [1] 10
tomato$Infection[10] #confirm that this is the error
## [1] pathogn
## Levels: pathogen pathogn water
tomato$Infection[10]<-"pathogen" #correct the error FOREVER by overwritting
tomato$Infection<-droplevels(tomato$Infection) #erase the empty factor level
summary(tomato) #check the summary again
##  rep      Treatement    Infection    total.area      chlorotic.area  
##  1:12   mock   :24   pathogen:24   Min.   :1034441   Min.   : 14666  
##  2:12   treated:24   water   :24   1st Qu.:2194426   1st Qu.: 31068  
##  3:12                              Median :2856845   Median : 54448  
##  4:12                              Mean   :2646678   Mean   : 81711  
##                                    3rd Qu.:3291366   3rd Qu.: 87013  
##                                    Max.   :4389281   Max.   :365390

2.3 Visualizing the data

Histograms are a fundamental tool in R for exploratory data analysis (EDA). They help you:

Visualize Distributions: Quickly see the shape of your data (e.g., normal, skewed, bimodal).

Spot Outliers & Gaps: Identify unusual values or missing ranges in your dataset.

Check Assumptions: Assess if data meets statistical assumptions (e.g., normality for t-tests).

Compare Groups: Overlay histograms to compare different subsets (e.g., treatment vs. control).

2.3.1 Histograms

In R, creating a histogram is simple (e.g., hist(data)), making it an essential first step in understanding your data before diving into analysis.

hist(tomato$total.area)

par(mfrow=c(1,2))# Create a 1 x 2 plotting matrix
hist(tomato$total.area)
hist(tomato$chlorotic.area)

2.3.2 Boxplots

Scatterplot and boxplots are also very common to look at the distribution of your variables. Here are a few examples:

par(mfrow=c(2,2))# Create a 2 x 2 plotting matrix
# The next 4 plots created will be plotted next to each other
plot(tomato$total.area~tomato$Treatement)
plot(total.area~Treatement, data=tomato) #this is the same but the XY labels change
plot(total.area~Infection, data=tomato)
plot(chlorotic.area~Infection, data=tomato)

What happens with an interaction?

plot(total.area~Infection*Treatement, data=tomato)

2.3.3 Scatterplots

They allow you to visualise all the data points

plot(total.area~chlorotic.area, data=tomato)

2.3.4 Transform data directly in plots

You can apply transformation in your plot formula:

par(mfrow=c(1,2))
plot(log(total.area)~Infection,data=tomato)
plot(I(total.area/100000)~Infection,data=tomato)

We can create new variables that combine Treatement and Infection as well as the ratio between both areas, then use it for plotting:

tomato$combo<-as.factor(paste(tomato$Treatement,tomato$Infection,sep=""))
tomato$rate<-tomato$chlorotic.area/tomato$total.area
tomato$combo<-factor(tomato$combo,levels=levels(tomato$combo)[c(2,1,4,3)]) #here we reorder the levels
plot(rate~combo,data=tomato)

2.4 Challenge N˚3

Create a new dataset from tomato but excluding rep #5.

Plot side-by-side the histogram of total.area with and without rep 5, as well as chlorotic.area with and without rep 5.

BONUS Create a dataset for each rep and plot the ratio between chlorotic.area/total.area for the for combined levels (combo).

2.5 Loops

Loops are very useful, as they help us do similar operations multiple times while writing them only once. Let’s say you want to compute the mean chlorotic area for treated and untreated plants.

mean_Mock <- mean(tomato$chlorotic.area[tomato$Treatement == "mock"])

mean_Treat <- mean(tomato$chlorotic.area[tomato$Treatement == "treated"])

cat("Mean chlorotic area (mock):", mean_Mock, "square mm" )
## Mean chlorotic area (mock): 104751.8 square mm
cat("Mean chlorotic area (treated):", mean_Treat , "square mm")
## Mean chlorotic area (treated): 58670.88 square mm

and so on for every variable. Using a loop, it would look like this:

for (i in c('mock', 'treated')) {
  mean_value <- mean(tomato$chlorotic.area[tomato$Treatement == i])
  cat("Mean chlorotic area (",i,"):", mean_Mock, "square mm\n")
} # the \n character forces a line break
## Mean chlorotic area ( mock ): 104751.8 square mm
## Mean chlorotic area ( treated ): 104751.8 square mm

It may not seems like much of a gain, but imagine if you had multiple levels to compute? Without a loop, you need 2 extra lines of code for every level. In a loop, you only need to add conditions to your vector.

2.6 Challenge N˚4

  1. Compute the % chlorosis by total area for each sample as a new column;
  2. Using a loop, compute the mean % chlorosis for every unique combination of Treatment, Infection and rep and print it to the console using cat() ;
  3. Rerun your loop, but this time instead of printing to the console, save the values in a matrix. Hint: you can create an empty matrix with the appropriate dimensions and fill it using the loop and row/columns indices.

3 Session 3 - Tuesday AM

Objectives:

  1. Learn the tidyverse syntax as an intuitive way to prepare our datasets.

  2. learn to use the ggplot package!

install.packages('tidyverse') # super-package that includes ggplot

3.1 The tidyverse syntax

When working with large datasets with multiple variables, R can become cumbersome. Imagine having to change 20 character variables to factors, or modifying a recurring typo in a data frame with 2,000 rows… fortunately, it happened to enough people for the community to create new functions to help you write concise, powerful scripts.

Yesterday, we covered functions that are included “by default” in R, what we call the Base R language. Over time, a large library of functions was developed, called tidyverse. These functions let you modify your data in a very intuitive manner.

Let’s learn the tidyverse syntax!

3.1.1 Understanding pipes %>%

We introduce a modern syntax from the tidyverse superpackage, which uses the pipe operator %>%. You can print it by using the shift+cmd+m keyboard combination (or shift+ctrl+m on Windows).

To show its usefulness in making codes much easier to read, consider the following challenge: Compute the mean proportion of chlorotic area of infected plants that were NOT treated, and format it to print as a percentage with 2 decimals..

A traditional base R code could look like this:

which_observations <- which(tomato$Infection == 'pathogen' & tomato$Treatement == 'mock')
inf_treat_chlor_area <- tomato$chlorotic.area[which_observations]
inf_treat_total_area <- tomato$total.area[which_observations]
prop_area <- 100*inf_treat_chlor_area/inf_treat_total_area

paste('Mean proportion:', round(mean(prop_area),2), "%")
## [1] "Mean proportion: 5.97 %"

Now compare the same thing with the tidyverse syntax. We will use 3 new functions, namely:

-mutate() is used to create new variables from existing ones

-filter() subsets our rows based on the values of variables

-pull() which returns a dataset column as a vector

We’ll also introduce the pipe %>% operator, which is also available in base R as |> (choose your favourite!).

library(tidyverse)

tomato_subset <- tomato %>% 
  filter(Infection == 'pathogen' & Treatement == 'mock') %>% 
  mutate(prop_area = 100*chlorotic.area/total.area) 

# Compute the mean
tomato_subset %>% 
  pull(prop_area) %>% 
  mean() %>% 
  round(2) %>% 
  paste("Mean proportion:", ., "%")
## [1] "Mean proportion: 5.97 %"

Let’s switch to a new dataset with thousands of observations. We’ll look at it briefly, then select a few columns of interest using select() and modify variable classes using mutate()

kangoroos <- read_xlsx('data/Marco_kangoroos.xlsx') 
## Warning: Expecting date in L1160 / R1160C12: got 'NA'
## Warning: Coercing text to numeric in Q1411 / R1411C17: '1.5'
## Warning: Coercing text to numeric in R1750 / R1750C18: '91.7'
## Warning: Expecting date in L2453 / R2453C12: got '20;08'
glimpse(kangoroos)
## Rows: 3,774
## Columns: 24
## $ ID              <dbl> 1, 1, 2, 3, 3, 3, 3, 3, 3, 3, 4, 5, 6, 6, 6, 6, 6, 6, …
## $ Sex             <chr> "M", "M", "F", "F", "F", "F", "F", "F", "F", "F", "F",…
## $ Age             <chr> "6", "8", "16", "0", "4", "5", "6", "7", "9", "10", "A…
## $ `Age certainty` <dbl> 4, 4, 4, 1, 1, 1, 1, 1, 1, 1, 0, 1, 4, 4, 4, 4, 4, 4, …
## $ Date            <chr> "Jul 27 08", "Nov 30 10", "Jul 27 08", "Jul 27 08", "S…
## $ No.             <dbl> 1, 2, 1, 1, 2, 3, 4, 5, 6, 7, 1, 1, 1, 2, 3, 4, 5, 6, …
## $ East            <chr> "7945", "8011", "8273", "8273", "8301", "8405", "8388"…
## $ North           <chr> "8411", "8325", "8362", "8362", "8163", "8337", "8431"…
## $ Method          <chr> "Crossbow", "Pole", "Crossbow", "PY", "T-pole", "T-pol…
## $ Year            <dbl> 2008, 2010, 2008, 2008, 2012, 2013, 2014, 2015, 2017, …
## $ Day             <dbl> 209, 334, 209, 209, 253, 246, 253, 295, 340, 255, 225,…
## $ Time            <dttm> NA, NA, NA, NA, NA, NA, 1904-01-01 07:45:00, 1904-01-…
## $ Mass            <dbl> 41.00, 46.25, 28.50, NA, 23.25, 26.00, 28.25, 28.50, 2…
## $ Foot            <dbl> 355.0, 367.0, 314.0, 174.0, 316.0, 317.0, 322.0, 323.0…
## $ Leg             <dbl> 595.0, 613.0, 515.0, 212.0, 502.0, 509.0, 517.0, 528.0…
## $ Arm             <dbl> 295, 316, NA, NA, 198, 212, 214, 222, 229, 225, 220, N…
## $ Teeth           <dbl> 3.0, 0.5, 3.0, NA, 2.0, 2.0, 1.5, 2.0, 1.5, 1.0, 1.0, …
## $ `Head length`   <dbl> NA, NA, NA, 99.4, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ Zygo            <chr> NA, NA, NA, NA, "89.7", "88.1", "92.5", "95.2", "95", …
## $ Notes           <chr> NA, NA, "mother of #3", "mother is #2", NA, "Mother of…
## $ Measurer        <dbl> 2, 1, 2, 2, 1, 1, 1, 8, 9, 1, 2, 2, 2, 1, 1, 3, 1, 1, …
## $ `new date`      <dttm> 2008-07-27, 2010-11-30, 2008-07-27, 2008-07-27, 2012-…
## $ `new birthdate` <dttm> NA, NA, NA, 2007-12-20, 2007-12-20, 2007-12-20, 2007-…
## $ `age months`    <dbl> NA, NA, NA, 7.236842, 56.743421, 68.552632, 80.789474,…
kangoroos <- kangoroos %>% 
  select(ID, Sex, Age, Mass, Leg, Date, Teeth) %>% 
  mutate(ID = as.factor(ID),
         Sex = as.factor(Sex),
         Age = as.numeric(Age)) 
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Age = as.numeric(Age)`.
## Caused by warning:
## ! NAs introduced by coercion

These warnings are suspicious! We’ll investigate them.

We can take advantage of that syntax to create more informative summaries of our data by using group_by(), a way to operate on subsets of our data without actually subsetting it; and passing the grouped dataframe to summarise(), which allows mathematical functions to be applied to every group.

glimpse(kangoroos)
## Rows: 3,774
## Columns: 7
## $ ID    <fct> 1, 1, 2, 3, 3, 3, 3, 3, 3, 3, 4, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7…
## $ Sex   <fct> M, M, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F…
## $ Age   <dbl> 6, 8, 16, 0, 4, 5, 6, 7, 9, 10, NA, 0, 12, 13, 14, 15, 16, 17, 1…
## $ Mass  <dbl> 41.00, 46.25, 28.50, NA, 23.25, 26.00, 28.25, 28.50, 29.50, 27.7…
## $ Leg   <dbl> 595.0, 613.0, 515.0, 212.0, 502.0, 509.0, 517.0, 528.0, 533.0, 5…
## $ Date  <chr> "Jul 27 08", "Nov 30 10", "Jul 27 08", "Jul 27 08", "Sept 9 12",…
## $ Teeth <dbl> 3.0, 0.5, 3.0, NA, 2.0, 2.0, 1.5, 2.0, 1.5, 1.0, 1.0, NA, 2.0, 2…
kang_group_summary <- kangoroos %>% 
  drop_na(Age, Sex) %>% 
  group_by(Sex) %>% 
  summarise(mean_age = mean(Age), 
            sd_age = sd(Age),
            n_obs = n()
            )

kang_group_summary
## # A tibble: 2 × 4
##   Sex   mean_age sd_age n_obs
##   <fct>    <dbl>  <dbl> <int>
## 1 F         4.99   4.55  1900
## 2 M         2.87   3.91  1107

Doing this using base R syntax would have required many intermediate objects, whereas here we get the end product with a single pipe that does all we need.

We can also format this table if we wanted to export it. The kableExtra package is quite useful for this. We first round the values to the second decimal using mutate(), then change the column names and style the table.

library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
kang_group_summary %>% 
  mutate(across(where(is.numeric), ~ round(., 1))) %>% 
  kable(col.names = c("Sex", "Mean Age", "SD Age", "N")) %>% 
  kable_styling("striped")
Sex Mean Age SD Age N
F 5.0 4.5 1900
M 2.9 3.9 1107

Here’s a fun operator, the assignment pipe %<>%. It acts as both a pipe and an assignment. It allows you to modify an object without creating a new object. You need to explicitly load the magrittr library even if tidyverse is loaded already.

library(magrittr)
## Warning: package 'magrittr' was built under R version 4.3.3
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract
kangoroos %<>% 
  mutate(Date = case_when(
    str_detect(Date, 'Sept') ~ str_replace(Date, 'Sept', 'Sep'),
               TRUE ~ Date),
    Date = parse_date_time(Date, orders = c('%m%d%y')),
    across(where(is.character), as.factor)
 )

kangoroos
## # A tibble: 3,774 × 7
##    ID    Sex     Age  Mass   Leg Date                Teeth
##    <fct> <fct> <dbl> <dbl> <dbl> <dttm>              <dbl>
##  1 1     M         6  41     595 2008-07-27 00:00:00   3  
##  2 1     M         8  46.2   613 2010-11-30 00:00:00   0.5
##  3 2     F        16  28.5   515 2008-07-27 00:00:00   3  
##  4 3     F         0  NA     212 2008-07-27 00:00:00  NA  
##  5 3     F         4  23.2   502 2012-09-09 00:00:00   2  
##  6 3     F         5  26     509 2013-09-03 00:00:00   2  
##  7 3     F         6  28.2   517 2014-09-10 00:00:00   1.5
##  8 3     F         7  28.5   528 2015-10-22 00:00:00   2  
##  9 3     F         9  29.5   533 2017-12-06 00:00:00   1.5
## 10 3     F        10  27.8   524 2018-09-12 00:00:00   1  
## # ℹ 3,764 more rows

3.2 Challenge N˚5

Using the summarise() command, find the mean Leg length and body mass for each sex at every age.

3.3 Plotting using ggplot

In this intro to GGplot2 we will look at mapping, layers, and scales. In ggplot(), you map your variables to what are called aesthetics and you add on layers to generate a plot that uses those mappings. You can then add in scales to further manipulate your plot.

We will map or select the data using aes(). Then we will select the type of plot we want using using a geom, for example: geom_histogram(); geom_point(); geom_boxplot(), geom_col(), geom_line(). We will make 4 plots, for each one we will learn a new syntax,slowly adding layers to the plot.

3.3.1 Histograms

The simplest plot which only uses a x or y aesthetic, to count frequencies.

kangoroos %>% 
  drop_na(Mass) %>% 
  ggplot(aes(x = Mass)) + 
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

3.3.2 Scatterplots

We’ll map another variable and add colours, using the color = ... argument in geom_point()

kangoroos %>% 
  drop_na(Mass, Leg) %>% 
  ggplot(aes(x = Mass, y = Leg)) + 
  geom_point(color = "blue") 

3.3.3 Boxplots

selecting color using color = ... and fill = ... in geom_boxplot() #Add labels and titles using labs(title=“name”, x =“name”, y=“name”)

kangoroos %>% 
  drop_na(Sex, Mass) %>% 
  ggplot(aes(x = Sex, y = Mass)) +
  geom_boxplot(color="black", fill = "orangered") + 
  labs(title = "Boxplot", x = "Treatment", y = "Total Area") 

3.3.4 Bar (column) plot

Selecting color using aes(fill = ...) and adding labels and titles.

kangoroos %>%
  group_by(Age, Sex) %>% 
  summarise(mean_mass = mean(Mass)) %>% 
  drop_na() %>% 
  ggplot(aes(x = Age, 
             y = mean_mass, 
             fill = Sex)
  ) +
  geom_col() + 
  labs(title = "Mean mass by age", x = "Age", y = "Mass") 
## `summarise()` has grouped output by 'Age'. You can override using the `.groups`
## argument.

By default geom_col will stack each observation on top of each other for each x axis value. You can place them next to each other using geom_col(position = "dodge")

kangoroos %>%
  group_by(Age, Sex) %>% 
  summarise(mean_mass = mean(Mass)) %>% 
  drop_na() %>% 
  ggplot(aes(x = Age, 
             y = mean_mass, 
             fill = Sex)
  ) +
  geom_col(position = 'dodge') + 
  labs(title = "Mean mass by age", x = "Age", y = "Mass") 
## `summarise()` has grouped output by 'Age'. You can override using the `.groups`
## argument.

3.3.5 Faceting plots

Say you want to separate your plots by infection to look at the effects separately. You can facet your plots using a categorical variable using facet_wrap() or facet_grid().

kangoroos %>% 
  drop_na() %>% 
  ggplot(aes(x = Leg,
             y = Mass, 
             colour = Age)) +
  geom_point() +
  facet_wrap(~Sex) +
  labs(title = "Mass and length of leg along age", x = "Mass", y = "Leg length") 

3.3.6 Adding points using jitter

We can add all data points on top of a box plot using geom_jitter() Notice what happens when you place jitter before vs. after geom_boxplot. We’ll also change the theme here, because that’s an option!

kangoroos %>% 
  drop_na() %>% 
  ggplot(aes(x = Sex, 
             y = Mass, 
             fill = Sex)) +
  geom_boxplot() +
  geom_jitter(width = 0.2, alpha = 0.5, size = 1) +
  labs(x = "Sex", y = "Mass") +
  theme_light()

3.3.7 Line plots

geom_line() is the function used to make line graphs. Let’s make the x axis the rownames of our table, and plot the total area on the y axis while colouring by treatment. We also add labels and titles using labs()

kangoroos %>% 
  drop_na() %>% 
  ggplot(aes(x = Age,            
             y = Mass, 
             group = ID, 
             color = Sex)
  ) + 
  geom_line() + 
  labs(x = "Age", y = "Mass") 

That plot does not look good, let’s change the order of the x axis using scale_x_discrete(). We want R to consider rownames as numeric for this so let’s tell R that this data should be numeric.

# kangoroos %>% 
#   ggplot(aes(x = rownames(tomato_mdf), 
#              y = total.area, group = Treatment, 
#              color = Treatment)
#   ) +
#   geom_line() +
#   scale_x_discrete(limits = sort(as.numeric(rownames(tomato_mdf)))) +  
#   labs(title = "Lineplot with Reordered Row Names", x = "Row Names", y = "Total Area")

3.4 Challenge N˚6

  1. Make a violin plot of the proportion of Mass for each sex state using geom_violin()
  2. Add the points on top of the violins
  3. Add labels
  4. Map the shape of the points to the Method variable
  5. Facet this plot by Method and Sex (try facet_grid())
  6. Change the theme (try em out!)
  7. Change the colours of the violins to “darkred” and “midnightblue” (hint: check the scale_fill_manual() function)
  8. Save your plot using the ggsave() function as a 1400 x 1200 pixels file in .png format

4 Session 4 - Tuesday PM

4.1 Setup: load packages

library(ggplot2)
library(patchwork)
## Warning: package 'patchwork' was built under R version 4.3.3
library(tidyr)
library(dplyr)
library(gapminder) # For demo dataset

Objectives

By the end of this session, you will:

Understand ggplot2’s theme system Apply built-in and custom themes Combine multiple plots with patchwork Create publication-quality figure layouts

4.2 GGPLOT2 Themes

You can find all the possible ggplot2 themes here: https://ggplot2.tidyverse.org/reference/ggtheme.html

data <- gapminder %>% filter(year == 2007)

p <- data %>%
  ggplot(aes(gdpPercap, lifeExp)) +
  geom_point(aes(size = pop, color = continent)) +
  scale_x_log10() +
  labs(title = "GDP vs Life Expectancy",
       x = "GDP per capita",
       y = "Life Expectancy")

# Demonstrate theme options
p + theme_dark()

p + theme_gray()     # Default

p + theme_bw()       # Black-and-white

p + theme_minimal()  # Clean minimal

p + theme_classic()  # Traditional

p + theme_void()     # Empty canvas

4.3 Customizing your theme

You can modify greatly all aspects of your figures. There are numerous resources online to browse colours in R but here is one that I like to use: https://colorbrewer2.org/#type=sequential&scheme=Reds&n=3.

p + theme(
  plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
  panel.background = element_rect(fill = "aliceblue"),
  legend.position = "bottom",
  axis.text = element_text(color = "darkblue")
)

## Facetwraps

Now we will do this figure (boxplot) step by step to learn how to incrementally build a ggplot2 figure.

Let’s first reload the tomato (clean) data and relabel the factors to practice our newly learned syntax.

#Load the data
tomato<-read_xlsx("data/Rocio_tomato_modified.xlsx",sheet=2)
#Prepare the relabel function for both variables for clean plotting
relabel.Treatement = as_labeller(c(mock = "Mock", 
                   treated = "Treated"))
relabel.Infection = as_labeller(c(pathogen = "Pathogen", 
                   water = "Water"))
#Transform our dataset with the %<>%
tomato %<>%
  rowwise() %>% 
  mutate(Treatement = factor(relabel.Treatement(Treatement))) %>%
  mutate(Infection = factor(relabel.Infection(Infection)))

Okay, let’s start to build our plot!

tomato %>%
  ggplot(aes(Infection,chlorotic.area/total.area,shape=Infection))+
  geom_boxplot(aes(fill=Infection))

Let’s add the facet!

(first_plot <- tomato %>%
  ggplot(aes(Infection,chlorotic.area/total.area,shape=Infection))+
  geom_boxplot(aes(fill=Infection))+
  facet_wrap(~Treatement))

Great, now we can change our theme to the dark side!

tomato %>%
  ggplot(aes(Infection,chlorotic.area/total.area,shape=Infection))+
  geom_boxplot(aes(fill=Infection))+
  facet_wrap(~Treatement)+
  theme_dark()

Let’s add a jitter to see better each observation.

tomato %>%
  ggplot(aes(Infection,chlorotic.area/total.area,shape=Infection))+
  geom_boxplot(aes(fill=Infection))+
  facet_wrap(~Treatement)+
  theme_dark()+
  geom_jitter(width = 0.1,aes(fill=Treatement))
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

Let’s change the colours and the shape of our point to be fillable.

tomato %>%
  ggplot(aes(Infection,chlorotic.area/total.area,shape=Infection))+
  geom_boxplot(aes(fill=Infection))+
  facet_wrap(~Treatement)+
  theme_dark()+
  geom_jitter(width = 0.1,aes(fill=Treatement))+
  scale_fill_manual(values = c("#69B3A2","#de2d26","white","white"))+
  scale_shape_manual(values = c(21,21))+
  ylab(label="Leaf chlorotic area (%)")+
  guides(fill=F,shape=F)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

Perfect, now on to changing the Y label to something more publication ready. We can also remove the legend.

(final_plot <- tomato %>%
  ggplot(aes(Infection,chlorotic.area/total.area,shape=Infection))+
  geom_boxplot(aes(fill=Infection))+
  facet_wrap(~Treatement)+
  theme_dark()+
  geom_jitter(width = 0.1,aes(fill=Treatement))+
  scale_fill_manual(values = c("#69B3A2","#de2d26","white","white"))+
  scale_shape_manual(values = c(21,21))+
  ylab(label="Leaf chlorotic area (%)")+
  guides(fill=F,shape=F))
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

4.4 Challenge 7

Create your own version of this plot, but changing the theme, the colours, the shape of the points, the X and Y labels, and adding a main title.

4.5 Patchwork

Patchwork is an amazing package that allows you to create perfect multi-plot figures for your publications.

Here is an overview of the possibilities:

p1 <- ggplot(mpg, aes(displ, hwy)) + geom_point()
p2 <- ggplot(mpg, aes(class)) + geom_bar()
p3 <- ggplot(mpg, aes(cty)) + geom_histogram()

# Horizontal layout
p1 + p2

# Vertical layout
p1 / p2

# Grid layout
(p1 | p2) / p3
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Here are some additional options:

# Add title to composition
(p1 + p2) + plot_annotation(title = "My Dashboard")

# Control spacing and guides
(p1 + p2 + p3) + 
  plot_layout(guides = "collect") +
  plot_annotation(tag_levels = "A")

4.6 Challenge 8

Using the boxplot figures that we created from the tomato dataset, combine three different plots (the first one and the final one that we created together + the one you modified)

Combine them into a 1x3 horizontal layout Add a shared title “Walk through a tomato plotting contest” Label subplots as A, B, C

4.7 Super challenge !

Using Marco’s dataset :

  1. Dynamically create a vector containing the top 3 most common capture methods
  2. Correct the Method variable so that pole is replaced by Pole and roadkill is replaced by Roadkill (because both are there). You can try the str_to_title() function which will also take care of RoadKill!
  3. Create a density ridge plot showing the distribution of masses for each age, in a way that allows us to distinguish each method and compare both sexes visually.
  4. Does one of these facet annoy you? Find a creative way to visualise it better.

4.7.1 Partial solution (we’ll finish tomorrow)

library(magrittr)
library(ggridges)

kang_roo <- read_xlsx('data/Marco_kangoroos.xlsx')  %>% 
  select(ID, 
         Sex, Age, Mass, Leg, Method, Teeth,
         Date) %>% 
  mutate(ID = as.factor(ID),
         Sex = as.factor(Sex),
         Age = as.numeric(Age)
         ) 
## Warning: Expecting date in L1160 / R1160C12: got 'NA'
## Warning: Coercing text to numeric in Q1411 / R1411C17: '1.5'
## Warning: Coercing text to numeric in R1750 / R1750C18: '91.7'
## Warning: Expecting date in L2453 / R2453C12: got '20;08'
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Age = as.numeric(Age)`.
## Caused by warning:
## ! NAs introduced by coercion
age_levels <- 
  kang_roo %>% 
  pull(Age) %>% 
  unique()

kang_roo %<>% 
  mutate(Method = str_to_sentence(Method)) 

these_methods <- kang_roo %>% 
  group_by(Method) %>% 
  summarise(obs.count = n()) %>% 
  arrange(desc(obs.count)) %>% 
  pull(Method) %>% 
  head(3)

kang_roo %>% 
  filter(Method %in% these_methods)
## # A tibble: 3,581 × 8
##    ID    Sex     Age  Mass   Leg Method Teeth Date      
##    <fct> <fct> <dbl> <dbl> <dbl> <chr>  <dbl> <chr>     
##  1 1     M         8  46.2 613   Pole     0.5 Nov 30 10 
##  2 3     F         0  NA   212   Py      NA   Jul 27 08 
##  3 3     F         4  23.2 502   T-pole   2   Sept 9 12 
##  4 3     F         5  26   509   T-pole   2   Sept 3 13 
##  5 3     F         6  28.2 517   T-pole   1.5 Sept 10 14
##  6 3     F         7  28.5 528   T-pole   2   Oct 22 15 
##  7 3     F         9  29.5 533   T-pole   1.5 Dec 6 17  
##  8 3     F        10  27.8 524   T-pole   1   Sept 12 18
##  9 4     F        NA  28   522   Pole     1   Aug 12 08 
## 10 5     F         0  NA    95.4 Py      NA   Aug 12 08 
## # ℹ 3,571 more rows

5 Session 5 - Wednesday AM

5.1 Set your working directory

The functions setwd() and getwd() are your friends.

Navigate in the Files panel with the blue wheel make sure that you see your working directory.

5.2 Load the packages you need

library(tidyverse)
library(readxl)

5.3 Look at your data

To start, we will load the tomato data and look at it.

tomato<-read_xlsx("data/Rocio_tomato_modified.xlsx",sheet=2)
tomato$Infection<-as.factor(tomato$Infection)
tomato$Infection[which(tomato$Infection=="pathogn")]<-"pathogen"
tomato$Infection<-droplevels(tomato$Infection)
tomato <- tomato %>%
  mutate(ratio = chlorotic.area / total.area)

pathogen_only <- tomato %>%  
  filter(Infection == "pathogen")

What should we do first?

hist(pathogen_only$ratio)

hist(pathogen_only$total.area)

hist(pathogen_only$chlorotic.area)

5.4 Normality

Then we could check for normality:

shapiro.test(pathogen_only$ratio)
## 
##  Shapiro-Wilk normality test
## 
## data:  pathogen_only$ratio
## W = 0.85178, p-value = 0.002359

OH MY GOD a p-value, what does this mean? Because it is <0.05, we can reject our H0 that the data is distributed normally. Thus, our data does not appear to follow a normal distribution.

Here is a different way to perform normality tests on all the variables you want:

variables_to_test <- c("chlorotic.area", "total.area", "ratio")

# Loop over variables and apply Shapiro-Wilk test
shapiro_results <- tomato %>%
  select(all_of(variables_to_test)) %>%  # Select only the target variables
  map(~ shapiro.test(.x)) %>%            # Apply Shapiro-Wilk test to each column
  map_dfr(~ broom::tidy(.x), .id = "variable")  # Convert results to a tidy data frame

# Print results
print(shapiro_results)
## # A tibble: 3 × 4
##   variable       statistic      p.value method                     
##   <chr>              <dbl>        <dbl> <chr>                      
## 1 chlorotic.area     0.731 0.0000000494 Shapiro-Wilk normality test
## 2 total.area         0.945 0.0250       Shapiro-Wilk normality test
## 3 ratio              0.731 0.0000000493 Shapiro-Wilk normality test

5.5 Homoskedasticity

The second assumption is equal variance in our data, let’s test it:

var.test(ratio~Treatement,data=pathogen_only)
## 
##  F test to compare two variances
## 
## data:  ratio by Treatement
## F = 1.9132, num df = 11, denom df = 11, p-value = 0.297
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.5507728 6.6459465
## sample estimates:
## ratio of variances 
##           1.913219

Okay, so let’s imagine our data is normal, what is the next step, look at our data again:

boxplot(ratio~Treatement,data=pathogen_only) # a quick plot

pathogen_only %>% ggplot(aes(Treatement, ratio, fill=Treatement, shape=Treatement))+
  geom_boxplot()+geom_jitter(width=0.1,aes(fill="white"),size=3)+theme_classic()+
  scale_shape_manual(values=c(21,21))+
  scale_fill_manual(values = c("#69B3A2","#de2d26","white"))+
  labs(title="Ratio of chlorotic and total areas impacted by treatment",x="Treatment",y="Ratio")+
  guides(fill="none",color="none",shape="none")

In this plot, our H0 is that there is no difference in the mean of the Y variable between the two groups. Our H1 could be that there is a difference (less or more), or we could already have hypothesized that the treated group will have a lower ratio because of literature or our previous knowledge. Thus, we could expect significant results, no? So let’s run a t-test:

t.test(ratio~Treatement,data=pathogen_only) # R defaults to the unequal variance test Welch Two Sample t-test
## 
##  Welch Two Sample t-test
## 
## data:  ratio by Treatement
## t = 2.993, df = 20.032, p-value = 0.007179
## alternative hypothesis: true difference in means between group mock and group treated is not equal to 0
## 95 percent confidence interval:
##  0.009977977 0.055854593
## sample estimates:
##    mean in group mock mean in group treated 
##            0.05972684            0.02681056
t.test(ratio~Treatement,data=pathogen_only, var.equal=T) # default is two-sided# default is two-sided
## 
##  Two Sample t-test
## 
## data:  ratio by Treatement
## t = 2.993, df = 22, p-value = 0.006702
## alternative hypothesis: true difference in means between group mock and group treated is not equal to 0
## 95 percent confidence interval:
##  0.01010862 0.05572395
## sample estimates:
##    mean in group mock mean in group treated 
##            0.05972684            0.02681056
t.test(ratio~Treatement,data=pathogen_only,alternative = "less",var.equal=T)
## 
##  Two Sample t-test
## 
## data:  ratio by Treatement
## t = 2.993, df = 22, p-value = 0.9966
## alternative hypothesis: true difference in means between group mock and group treated is less than 0
## 95 percent confidence interval:
##        -Inf 0.05180078
## sample estimates:
##    mean in group mock mean in group treated 
##            0.05972684            0.02681056
t.test(ratio~Treatement,data=pathogen_only,alternative = "greater",var.equal=T)
## 
##  Two Sample t-test
## 
## data:  ratio by Treatement
## t = 2.993, df = 22, p-value = 0.003351
## alternative hypothesis: true difference in means between group mock and group treated is greater than 0
## 95 percent confidence interval:
##  0.01403179        Inf
## sample estimates:
##    mean in group mock mean in group treated 
##            0.05972684            0.02681056

Let’s now run the non-parametric equivalent, the Mann-Whitney-Wilcoxon Test:

wilcox.test(ratio~Treatement,data=pathogen_only)
## 
##  Wilcoxon rank sum exact test
## 
## data:  ratio by Treatement
## W = 124, p-value = 0.00183
## alternative hypothesis: true location shift is not equal to 0

Still significant, we can now reject our H0. It seems like the means of the two groups are not equal. Did we demonstrate H1? No! But we cannot accept H0.

5.6 Challenge N˚9

Create a water_only dataset Look at the histogram of the ratio variable Check for the normality and homoskedasticity of the ratio variable Perform a t.test and a Mann-Whitney-Wilcoxon

5.7 Install more packages!

library(pacman)
p_load(tidyverse, 
       ggpubr, 
       FSA, 
       dunn.test, 
       rstatix,
       car,
       readxl)

5.8 Comparing the mean across more than two groups

Let’s use the Moss data set and plot the Nitrogen fixation (ARA_C2H4) across sampling months in 2017 independently for each moss species.

# Load data
Moss <- read_excel("data/Marie_Mosses.xlsx")

# Transform some variables as factors
Moss$Year <- as.factor(Moss$Year)
Moss$Months <- factor(Moss$Months, levels = c("June","September","October"))

# Subset to keep only data from 2017
Moss2017 <- Moss[Moss$Year=="2017",]

# Subset to keep only data from one moss specie
Moss2017_PCC <- Moss2017[Moss2017$Species == "Ptilium crista-castrensis",]

# Plot it
ggplot(data = Moss2017_PCC, aes(x = Months, y = ARA_C2H4, fill = Months))+
  geom_boxplot()+
  theme_bw()+
  scale_fill_manual(values = c("green","green4", "darkgreen"))

We want to evaluate if there is a significant difference in the nitrogen fixation across time. Therefore, we want to compare the mean of 3 groups (June, September, October).

5.8.1 ANOVA (parametric)

Analysis of Variance

5.8.1.1 Check the conditions to apply ANOVA

  1. Independent observation
  2. Normality (shapiro.test()) :
shapiro.test(Moss2017_PCC$ARA_C2H4[Moss2017_PCC$Months == "June"])
## 
##  Shapiro-Wilk normality test
## 
## data:  Moss2017_PCC$ARA_C2H4[Moss2017_PCC$Months == "June"]
## W = 0.89578, p-value = 0.387
# p>0.05: data is normal

shapiro.test(Moss2017_PCC$ARA_C2H4[Moss2017_PCC$Months == "September"])
## 
##  Shapiro-Wilk normality test
## 
## data:  Moss2017_PCC$ARA_C2H4[Moss2017_PCC$Months == "September"]
## W = 0.95196, p-value = 0.7512
# p>0.05: data is normal

shapiro.test(Moss2017_PCC$ARA_C2H4[Moss2017_PCC$Months == "October"])
## 
##  Shapiro-Wilk normality test
## 
## data:  Moss2017_PCC$ARA_C2H4[Moss2017_PCC$Months == "October"]
## W = 0.93141, p-value = 0.606
# p>0.05: data is normal
  1. Homogeneous variance (leveneTest()) :
leveneTest(Moss2017_PCC$ARA_C2H4 ~ Moss2017_PCC$Months)
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  2    1.08 0.3704
##       12
# p>0.05: variances are not significantly different

5.8.1.2 Perform ANOVA

# Using anova()
res <- anova(lm(Moss2017_PCC$ARA_C2H4 ~ Moss2017_PCC$Months))
res # No difference between groups (Months)
## Analysis of Variance Table
## 
## Response: Moss2017_PCC$ARA_C2H4
##                     Df Sum Sq Mean Sq F value Pr(>F)
## Moss2017_PCC$Months  2 1081.9  540.95  1.6143 0.2394
## Residuals           12 4021.3  335.11
# Using aov()
res2 <- aov(Moss2017_PCC$ARA_C2H4 ~ Moss2017_PCC$Months)
summary(res2) # No difference between groups (Months)
##                     Df Sum Sq Mean Sq F value Pr(>F)
## Moss2017_PCC$Months  2   1082   540.9   1.614  0.239
## Residuals           12   4021   335.1

According to the anova, there is no difference between sampling months.

5.8.2 Tukey test (parametric post-hoc)

If the anova had revealed a difference, we would have performed a Tukey test (parametric post-hoc) to know between which group there is a difference.

# Difference between which groups?
TukeyHSD(res2) # No difference between groups (Months)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Moss2017_PCC$ARA_C2H4 ~ Moss2017_PCC$Months)
## 
## $`Moss2017_PCC$Months`
##                        diff       lwr      upr     p adj
## September-June    13.779040 -17.10865 44.66673 0.4809339
## October-June      20.386686 -10.50101 51.27438 0.2239070
## October-September  6.607645 -24.28005 37.49534 0.8379383

5.8.3 Kruskal-Wallis (non-parametric)

Let’s use the flowers data set and plot the alpha-diversity for each site (code).

# Import dataset
flowers <- read_excel("data/meta.flower.2022.xlsx")
# Transform the variable code as factor
flowers$code <- as.factor(flowers$code)

# Plot alpha-diversity for each site (code)
(plot.flower <- ggplot(flowers, aes(code, alphadiv, fill = code))+
  geom_boxplot()+
    theme_bw())

5.8.3.1 Test normality

shapiro.test(flowers$alphadiv[flowers$code == "A"])
## 
##  Shapiro-Wilk normality test
## 
## data:  flowers$alphadiv[flowers$code == "A"]
## W = 0.96274, p-value = 0.8265
# p>0.05: data is normal
shapiro.test(flowers$alphadiv[flowers$code == "B1"])
## 
##  Shapiro-Wilk normality test
## 
## data:  flowers$alphadiv[flowers$code == "B1"]
## W = 0.74743, p-value = 0.005061
# p>0.05: data is NOT normal
shapiro.test(flowers$alphadiv[flowers$code == "C"])
## 
##  Shapiro-Wilk normality test
## 
## data:  flowers$alphadiv[flowers$code == "C"]
## W = 0.82122, p-value = 0.03557
# p>0.05: data is NOT normal

5.8.3.2 Perform Kruskal test

kruskal_test(flowers, alphadiv~code)
## # A tibble: 1 × 6
##   .y.          n statistic    df        p method        
## * <chr>    <int>     <dbl> <int>    <dbl> <chr>         
## 1 alphadiv    27      17.7     2 0.000145 Kruskal-Wallis
# there is a significant difference between some groups

5.8.4 Dunn test (non-parametric post-hoc)

dunn_test(flowers, alphadiv~code)
## # A tibble: 3 × 9
##   .y.      group1 group2    n1    n2 statistic        p    p.adj p.adj.signif
## * <chr>    <chr>  <chr>  <int> <int>     <dbl>    <dbl>    <dbl> <chr>       
## 1 alphadiv A      B1         9     9     0.564 0.573    0.573    ns          
## 2 alphadiv A      C          9     9    -3.33  0.000881 0.00176  **          
## 3 alphadiv B1     C          9     9    -3.89  0.000100 0.000301 ***
# Differences are between groups A-C and B1-C
# No difference between groups A-B1

Let’s plot the statistical result on the plot

# Store dunn test result in an object
dunn.res <- dunn_test(flowers, alphadiv~code)
# Manually extract p-values from dunn test results
p_values <- dunn.res$p.adj.signif[c(1, 2, 3)]  
# Find x and y positions for the p-value
stat.flower <- add_xy_position(dunn.res, x = "code")
# List of comparison
comparisons <- list(c("A", "B1"), c("A", "C"), c("B1", "C"))
# Add on the plot
plot.flower+
  geom_signif(comparisons = comparisons,
              annotations = p_values,
              y_position = stat.flower$y.position) 

To conclude, if we want two compare the mean between more than two groups we first need to know if we use a parametric (Anova) or a non-parametric (Kruskal-Wallis) test. So we first test the normality of all group using shapiro.test() (p<0.05: Not normal; p>0.05 Normal) and the homogeneity of the variance using leveneTest().

Depending on the results, we will perform an Anova or a Kruskal test. If the result of the Anova is significant, we then perform a post-hoc Tukey test. If the the result of the Kruskal test is dignificant, we then perform a post-hoc Dunn test.

5.9 Correlation between two variables

5.9.1 Pearson & Spearman

Pearson = parametric
Spearman = non-parametric

####Test normality:

shapiro.test(Moss2017$Fe_ppm)
## 
##  Shapiro-Wilk normality test
## 
## data:  Moss2017$Fe_ppm
## W = 0.76572, p-value = 1.652e-05
# p<0.05: Not normal
shapiro.test(Moss2017$P_ppm)
## 
##  Shapiro-Wilk normality test
## 
## data:  Moss2017$P_ppm
## W = 0.97159, p-value = 0.5834
# p>0.05 Normal
shapiro.test(Moss2017$CN_Ratio)
## 
##  Shapiro-Wilk normality test
## 
## data:  Moss2017$CN_Ratio
## W = 0.96198, p-value = 0.3476
# p>0.05 Normal
shapiro.test(Moss2017$`N_mg.g-1`)
## 
##  Shapiro-Wilk normality test
## 
## data:  Moss2017$`N_mg.g-1`
## W = 0.96918, p-value = 0.517
# p>0.05 Normal

####Perform correlation test:

# Pearson (parametric)
plot(Moss2017$P_ppm,Moss2017$CN_Ratio)

cor.test(Moss2017$P_ppm, Moss2017$CN_Ratio, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  Moss2017$P_ppm and Moss2017$CN_Ratio
## t = -2.3966, df = 28, p-value = 0.02347
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.67283141 -0.06143071
## sample estimates:
##        cor 
## -0.4125691
# p<0.05: Reject H0 -> there is a correlation (r = -0.4)

# Spearman (non-parametric)
plot(Moss2017$Fe_ppm, Moss2017$P_ppm)

cor.test(Moss2017$Fe_ppm, Moss2017$P_ppm, method = "spearman")
## 
##  Spearman's rank correlation rho
## 
## data:  Moss2017$Fe_ppm and Moss2017$P_ppm
## S = 6922, p-value = 0.002405
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.5399333
# p<0.05: Reject H0 -> there is a correlation (r = -0.5)

# Pearson (parametric)
plot(Moss2017$`N_mg.g-1`, Moss2017$CN_Ratio)

cor.test(Moss2017$`N_mg.g-1`, Moss2017$CN_Ratio, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  Moss2017$`N_mg.g-1` and Moss2017$CN_Ratio
## t = -23.305, df = 28, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9882499 -0.9479519
## sample estimates:
##        cor 
## -0.9751798
# p<0.05: Reject H0 -> there is a correlation (r = -0.98)

5.9.1.1 Add correlation coefficient and p-value to the plot:

ggplot(Moss2017, aes(`N_mg.g-1`,CN_Ratio, add = "reg.line"))+
  geom_point()+
  theme_bw()+
  geom_smooth(method = "lm")+
  stat_cor(method = "pearson", label.x = 9, label.y = 63)
## `geom_smooth()` using formula = 'y ~ x'

5.10 Challenge N˚10

Test if there is a correlation between two other variables from the moss data set
- Test normality of both variables
- Perform appropriate correlation test
- Plot it and add correlation coefficient and p-value

6 Session 6 - Wednesday PM

6.1 Import Data:

GR <- read_excel("data/Green_Roof.xlsx")

For this analysis we want to know how roof age (the explanatory variable) influences the response variable plant cover (Moss, Suc) let’s start with Moss, we will need to filter out all other cover types.

Cut<- GR %>% filter(Spp == "Moss")

6.1.1 Check Normality of Moss using the function

shapiro.test() Get Moss as close to normality as possible (reciprocal, logarithm, cube root, square root, and square)

shapiro.test(log10(Cut$Cover))
## 
##  Shapiro-Wilk normality test
## 
## data:  log10(Cut$Cover)
## W = 0.94629, p-value = 0.3143
hist(log10(Cut$Cover))

6.1.2 Run Regression

Here moss will be the response variable and age will be the explanatory variable. lm() is the function for simple regression.

lm(formula = log10(Cover) ~ Age, data = Cut)
## 
## Call:
## lm(formula = log10(Cover) ~ Age, data = Cut)
## 
## Coefficients:
## (Intercept)          Age  
##     1.24809      0.01137

This code works, but it only gives us the Coefficients. To get more details let’s save it in an object and use the summary function:

Reg<-lm(formula = log10(Cover) ~ Age, data = Cut)
summary(Reg)
## 
## Call:
## lm(formula = log10(Cover) ~ Age, data = Cut)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.09620 -0.20377  0.02189  0.30202  0.54444 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.24809    0.20464   6.099 9.21e-06 ***
## Age          0.01137    0.01372   0.829    0.418    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4522 on 18 degrees of freedom
## Multiple R-squared:  0.03678,    Adjusted R-squared:  -0.01673 
## F-statistic: 0.6873 on 1 and 18 DF,  p-value: 0.418

6.1.3 Plot it

Now let’s make a plot of our regresstion using GGplot 2

ggplot(Cut, aes(x=Age, y=Cover)) +
  geom_point(size=2)+
  geom_smooth(method=lm , color="red", se=FALSE)+
  theme_set(theme_bw())
## `geom_smooth()` using formula = 'y ~ x'

6.2 Challenge N˚11

Create a plot with 2 regression lines, moss and suc and write the r2 value on the plot

ggplot(GR, aes(x=Age, y=Cover, color=Spp)) +
  geom_point(size=2)+
  geom_smooth(method=lm , se=FALSE)+
  theme_set(theme_bw())
## `geom_smooth()` using formula = 'y ~ x'